The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('/Users/ritima/Desktop/Project Unsupervised Learning/stock_data.csv')
# copying data to another varaible to avoid any changes to original data
data = df.copy()
# Checking the number of rows and columns in the train data
data.shape
(340, 15)
# Viewing a random sample of the dataset
data.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000 | -15.478 | 2.924 | 205 | 70 | 830000000 | -14454000000 | -35.550 | 406582278.500 | 93.089 | 1.786 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660 | 16.224 | 1.321 | 8 | 958 | 592000000 | 3669000000 | 1.310 | 2800763359.000 | 79.893 | 5.884 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.030 | 7.579 | 1.163 | 15 | 47 | 21818000 | 248710000 | 1.520 | 163625000.000 | 26.336 | -1.269 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410 | 10.661 | 0.806 | 17 | 129 | 160383000 | 636056000 | 3.280 | 491391569.000 | 24.070 | -2.257 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610 | 0.865 | 1.590 | 32 | 64 | -588000000 | -7829000000 | -10.230 | 765298142.700 | 93.089 | 3.345 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516 | -8.699 | 1.479 | 142 | 27 | 159000000 | 1293000000 | 2.970 | 435353535.400 | 17.682 | -3.838 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130 | -21.254 | 2.365 | 2 | 201 | 523803000 | 85171000 | 0.560 | 152091071.400 | 93.089 | 9.568 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040 | -5.102 | 1.966 | 4 | 189 | 7786000000 | -671000000 | -0.790 | 849367088.600 | 93.089 | 17.346 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.620 | 3.654 | 1.160 | 20 | 99 | 2288000000 | 2297000000 | 5.140 | 446887159.500 | 10.432 | -0.376 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480 | 7.067 | 1.581 | 12 | 67 | 412000000 | 968100000 | 2.260 | 428362831.900 | 14.814 | 4.219 |
# checking the column names and datatypes
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
# Viewing the statistical summary of the numerical columns in the data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.000 | 80.862 | 98.055 | 4.500 | 38.555 | 59.705 | 92.880 | 1274.950 |
| Price Change | 340.000 | 4.078 | 12.006 | -47.130 | -0.939 | 4.820 | 10.695 | 55.052 |
| Volatility | 340.000 | 1.526 | 0.592 | 0.733 | 1.135 | 1.386 | 1.696 | 4.580 |
| ROE | 340.000 | 39.597 | 96.548 | 1.000 | 9.750 | 15.000 | 27.000 | 917.000 |
| Cash Ratio | 340.000 | 70.024 | 90.421 | 0.000 | 18.000 | 47.000 | 99.000 | 958.000 |
| Net Cash Flow | 340.000 | 55537620.588 | 1946365312.176 | -11208000000.000 | -193906500.000 | 2098000.000 | 169810750.000 | 20764000000.000 |
| Net Income | 340.000 | 1494384602.941 | 3940150279.328 | -23528000000.000 | 352301250.000 | 707336000.000 | 1899000000.000 | 24442000000.000 |
| Earnings Per Share | 340.000 | 2.777 | 6.588 | -61.200 | 1.558 | 2.895 | 4.620 | 50.090 |
| Estimated Shares Outstanding | 340.000 | 577028337.754 | 845849595.418 | 27672156.860 | 158848216.100 | 309675137.800 | 573117457.325 | 6159292035.000 |
| P/E Ratio | 340.000 | 32.613 | 44.349 | 2.935 | 15.045 | 20.820 | 31.765 | 528.039 |
| P/B Ratio | 340.000 | -1.718 | 13.967 | -76.119 | -4.352 | -1.067 | 3.917 | 129.065 |
# Viewing the statistical summary of the numerical columns in the data
data.describe(include=[object]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 |
| Security | 340 | 340 | American Airlines Group | 1 |
| GICS Sector | 340 | 11 | Industrials | 53 |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 |
Current Price, ROE, Cash Ratio, P/E Ratio and P/B Ratio.average stock price is around 80 dollars, while the maximum is 1275 dollars.Cash Ratio is 0, indicating a scenario where a company's total reserves of cash and cash equivalents were equal to its total current liabilities, i.e., the company was not cash positive.Net Cash Flow, which means a company's cash outflows exceeded its cash inflows.Net Income has a negative minimum value, suggesting possible losses for a company/companies. Likewise, Earnings Per Share, which is the company's net profit/loss divided by the number of common shares it has outstanding has a minimum value of -61 dollars.P/E Ratio is 32, while the maximum is 528, indicating outliers at the right end.P/B Ratio, which is the ratio of a company's stock price per share by its book value per share, is around -2 and median is -1, indicating suboptimal performance of company stocks.Ticker Symbol and Security are unique i.d. columns and hence should be dropped from further analysis.GICS Sector has 11 unique values, where the maximum frequency is 53 for the sector - Industrials. this means that out of 340 companies, 53 belong to the GICS Sector - Industrials.GICS Sub Industry variable has 104 unique values, with Oil & Gas Exploration & Production having the maximum frequency of 16 (companies belonging to this sub-industry group).Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="pink"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Creating histogram_boxplot for 'Current Price'
histogram_boxplot(data, 'Current Price')
# Checking the observations which have current price more than 600 dollars
data.loc[data["Current Price"] > 600]
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 26 | AMZN | Amazon.com Inc | Consumer Discretionary | Internet & Direct Marketing Retail | 675.890 | 32.268 | 1.460 | 4 | 58 | 1333000000 | 596000000 | 1.280 | 465625000.000 | 528.039 | 3.904 |
| 243 | PCLN | Priceline.com Inc | Consumer Discretionary | Internet & Direct Marketing Retail | 1274.950 | 3.191 | 1.268 | 29 | 184 | -1671386000 | 2551360000 | 50.090 | 50935516.070 | 25.453 | -1.052 |
# Creating histogram_boxplot for 'Price Change'
histogram_boxplot(data, 'Price Change')
# Creating histogram_boxplot for 'Volatility'
histogram_boxplot(data, 'Volatility')
# Creating histogram_boxplot for 'ROE'
histogram_boxplot(data, 'ROE')
# Creating histogram_boxplot for 'Cash Ratio'
histogram_boxplot(data, 'Cash Ratio')
# Creating histogram_boxplot for 'Net Cash Flow'
histogram_boxplot(data, 'Net Cash Flow')
# Creating histogram_boxplot for 'Net Income'
histogram_boxplot(data, 'Net Income')
# Creating histogram_boxplot for 'Earnings Per Share'
histogram_boxplot(data, 'Earnings Per Share')
# Creating histogram_boxplot for 'Estimated Shares Outstanding'
histogram_boxplot(data, 'Estimated Shares Outstanding')
# Creating histogram_boxplot for 'P/E Ratio'
histogram_boxplot(data, 'P/E Ratio')
# Creating histogram_boxplot for 'P/B Ratio'
histogram_boxplot(data, 'P/B Ratio')
Current Price is highly skewed to the right, which indicates the presence of very high stock prices. The distributions of ROE and Cash Ratio exhibit a similar trend.Net Cash Flow, Net Income and Earnings Per Share have a lot of outliers, suggesting higher spread among the values.# selecting numerical columns
num_col = data.select_dtypes(include=np.number).columns.tolist()
fig, axes = plt.subplots(3, 2, figsize=(20, 15))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(3):
sns.ecdfplot(ax=axes[ii][0], x=df[num_col[counter]])
counter = counter + 1
if counter != 5:
sns.ecdfplot(ax=axes[ii][1], x=df[num_col[counter]])
counter = counter + 1
else:
pass
fig.tight_layout(pad=2.0)
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Creating a labelled barplot for 'GICS Sector'
labeled_barplot(data, 'GICS Sector', perc=True)
# Creating a labelled barplot for 'GICS Sub Industry'
labeled_barplot(data, 'GICS Sub Industry', perc=True)
# Cross-tabulating by industry sectors and sub-industry groups
ss = pd.crosstab(index=data["GICS Sub Industry"], columns=data["GICS Sector"])
ss
| GICS Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GICS Sub Industry | |||||||||||
| Advertising | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Aerospace & Defense | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Agricultural Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Air Freight & Logistics | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Airlines | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Alternative Carriers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Apparel, Accessories & Luxury Goods | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Application Software | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Asset Management & Custody Banks | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Auto Parts & Equipment | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Automobile Manufacturers | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Banks | 0 | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Biotechnology | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
| Brewers | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Broadcasting & Cable TV | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Building Products | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Cable & Satellite | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Casinos & Gaming | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Computer Hardware | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Construction & Farm Machinery & Heavy Trucks | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Construction Materials | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Consumer Electronics | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Consumer Finance | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Copper | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Data Processing & Outsourced Services | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Distributors | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Diversified Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 |
| Diversified Commercial Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Diversified Financial Services | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Drug Retail | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Electric Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 |
| Electrical Components & Equipment | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Electronic Components | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 |
| Electronic Equipment & Instruments | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Environmental Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Fertilizers & Agricultural Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Financial Exchanges & Data | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Gold | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Health Care Distributors | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Equipment | 0 | 0 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Facilities | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Health Care Supplies | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| Home Entertainment Software | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Home Furnishings | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Homebuilding | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hotels, Resorts & Cruise Lines | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Appliances | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Household Products | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Housewares & Specialties | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Human Resource & Employment Services | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| IT Consulting & Other Services | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 |
| Industrial Conglomerates | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 |
| Industrial Gases | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Industrial Machinery | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 |
| Industrial Materials | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Insurance Brokers | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Oil & Gas | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Integrated Telecommunications Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
| Internet & Direct Marketing Retail | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Internet Software & Services | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 |
| Investment Banking & Brokerage | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Leisure Products | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life & Health Insurance | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Life Sciences Tools & Services | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Managed Health Care | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| Metal & Glass Containers | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Motorcycle Manufacturers | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-Sector Holdings | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Multi-line Insurance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MultiUtilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 |
| Networking Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Office REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Oil & Gas Equipment & Services | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Exploration & Production | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Oil & Gas Refining & Marketing & Transportation | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Packaged Foods & Meats | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Paper Packaging | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
| Personal Products | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Pharmaceuticals | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| Property & Casualty Insurance | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Publishing | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 |
| Railroads | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Real Estate Services | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Regional Banks | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Research & Consulting Services | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Residential REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Restaurants | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Retail REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 |
| Semiconductor Equipment | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Semiconductors | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 |
| Soft Drinks | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialized REITs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
| Specialty Chemicals | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 |
| Specialty Retail | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Specialty Stores | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Steel | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Technology Hardware, Storage & Peripherals | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Technology, Hardware, Software and Supplies | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Thrifts & Mortgage Finance | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tires & Rubber | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Tobacco | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Trucking | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| Water Utilities | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# Creating Stacked Bar Chart for industry sector and sub-industry group
ss.plot(
kind="bar",
colormap="nipy_spectral",
figsize=(22, 8),
rot=1,
stacked=True,
linewidth=2,
edgecolor="navy",
fontsize=14,)
plt.title("Industry Sectors and Sub-Industry Groups", fontsize=10) # Setting plot title
plt.xticks(rotation=90, fontsize=8)
plt.xlabel("GICS Sub Industry", fontsize=10)
Text(0.5, 0, 'GICS Sub Industry')
16% (i.e. 53 companies) of the 340 companies in the dataset belonged to the Industrials sector. This was followed by 14.4% (i.e. 49 companies) in Financials sector and 11.8% (i.e. 40 companies) each in Consumer Discretionary & Health Care sectors.1.5% (5 companies) fell under the Telecommunications Services sector.(4.7%) belonged to the Oil & Gas Exploration & Production, followed by 4.1% in the Real Estate Investment Boards (REITs) and Industrial Conglomerates groups each and 3.5% in the Internet software & Services and Electric Utilities groups each.# Current Price and security/company
cpc = data.groupby(["Security", "GICS Sector", "GICS Sub Industry"])["Current Price"].mean().reset_index()
cpc.sort_values('Current Price', ascending=False)
| Security | GICS Sector | GICS Sub Industry | Current Price | |
|---|---|---|---|---|
| 250 | Priceline.com Inc | Consumer Discretionary | Internet & Direct Marketing Retail | 1274.950 |
| 18 | Amazon.com Inc | Consumer Discretionary | Internet & Direct Marketing Retail | 675.890 |
| 173 | Intuitive Surgical Inc. | Health Care | Health Care Equipment | 546.160 |
| 261 | Regeneron | Health Care | Biotechnology | 542.870 |
| 75 | Chipotle Mexican Grill | Consumer Discretionary | Restaurants | 479.850 |
| ... | ... | ... | ... | ... |
| 37 | Arconic Inc | Industrials | Aerospace & Defense | 7.399 |
| 278 | Southwestern Energy | Energy | Oil & Gas Exploration & Production | 7.110 |
| 141 | Freeport-McMoran Cp & Gld | Materials | Copper | 6.770 |
| 142 | Frontier Communications | Telecommunications Services | Integrated Telecommunications Services | 4.670 |
| 73 | Chesapeake Energy | Energy | Integrated Oil & Gas | 4.500 |
340 rows × 4 columns
# Looking at the top 10 entries for Current Price and security/company
cpcc = cpc.sort_values('Current Price', ascending=False).head(10)
cpcc
| Security | GICS Sector | GICS Sub Industry | Current Price | |
|---|---|---|---|---|
| 250 | Priceline.com Inc | Consumer Discretionary | Internet & Direct Marketing Retail | 1274.950 |
| 18 | Amazon.com Inc | Consumer Discretionary | Internet & Direct Marketing Retail | 675.890 |
| 173 | Intuitive Surgical Inc. | Health Care | Health Care Equipment | 546.160 |
| 261 | Regeneron | Health Care | Biotechnology | 542.870 |
| 75 | Chipotle Mexican Grill | Consumer Discretionary | Restaurants | 479.850 |
| 209 | Mettler Toledo | Health Care | Life Sciences Tools & Services | 339.130 |
| 43 | BIOGEN IDEC Inc. | Health Care | Biotechnology | 306.350 |
| 119 | Equinix | Real Estate | REITs | 302.400 |
| 14 | Alliance Data Systems | Information Technology | Data Processing & Outsourced Services | 276.570 |
| 273 | Sherwin-Williams | Materials | Specialty Chemicals | 259.600 |
# Creating barplot of current price and company/security (head)
plt.figure(figsize=(12, 6))
sns.barplot(data=cpcc, x="Security", y="Current Price", palette="viridis", ci=None)
plt.title("Current Price by Security (Top Companies)", fontsize=14) # Setting plot title
plt.xlabel("Security", fontsize=12)
plt.xticks(rotation=90)
plt.show()
Priceline.com Inc is currently priced the highest at ~1275 dollarsAmazon.com Inc is second, with a stock price of ~675 dollarsIntuitive Surgical Inc. and Regeneron come in next, both with a current stock price of higher than 500 dollars# Looking at the last 10 entries for Current Price and security/company
cpcb = cpc.sort_values('Current Price', ascending=False).tail(10)
cpcb
| Security | GICS Sector | GICS Sub Industry | Current Price | |
|---|---|---|---|---|
| 196 | Marathon Oil Corp. | Energy | Oil & Gas Exploration & Production | 12.590 |
| 153 | HP Inc. | Information Technology | Computer Hardware | 11.840 |
| 220 | Navient | Financials | Consumer Finance | 11.450 |
| 165 | Huntington Bancshares | Financials | Banks | 11.060 |
| 332 | Xerox Corp. | Information Technology | IT Consulting & Other Services | 10.630 |
| 37 | Arconic Inc | Industrials | Aerospace & Defense | 7.399 |
| 278 | Southwestern Energy | Energy | Oil & Gas Exploration & Production | 7.110 |
| 141 | Freeport-McMoran Cp & Gld | Materials | Copper | 6.770 |
| 142 | Frontier Communications | Telecommunications Services | Integrated Telecommunications Services | 4.670 |
| 73 | Chesapeake Energy | Energy | Integrated Oil & Gas | 4.500 |
# Creating barplot of current price and company/security (tail)
plt.figure(figsize=(12, 6))
sns.barplot(data=cpcb, x="Security", y="Current Price", palette="viridis", ci=None)
plt.title("Current Price by Security (Bottom Securities)", fontsize=14) # Setting plot title
plt.xlabel("Security", fontsize=12)
plt.xticks(rotation=90)
plt.show()
Chesapeake Energy and Frontier Communications are priced the lowest at ~5 dollarsFreeport-McMoran Cp & Gld, Southwestern Energy and Arconic Inc at ~7 dollars# Price Change and security/company
cpsec = data.groupby(["Security"])["Price Change"].mean().reset_index()
cpsec.sort_values('Price Change', ascending=False)
| Security | Price Change | |
|---|---|---|
| 134 | First Solar Inc | 55.052 |
| 106 | Du Pont (E.I.) | 37.490 |
| 299 | TripAdvisor | 34.804 |
| 217 | Mylan N.V. | 33.177 |
| 18 | Amazon.com Inc | 32.268 |
| ... | ... | ... |
| 141 | Freeport-McMoran Cp & Gld | -31.685 |
| 75 | Chipotle Mexican Grill | -33.131 |
| 73 | Chesapeake Energy | -38.102 |
| 278 | Southwestern Energy | -44.798 |
| 183 | Kinder Morgan | -47.130 |
340 rows × 2 columns
First Solar Inc, Du Pont (E.I.), TripAdvisor, Mylan N.V. and Amazon.com Inc saw the maximum percentage increase in stock prices over a 13 week periodKinder Morgan, Southwestern Energy, Chesapeake Energy, Chipotle Mexican Grill and Freeport-McMoran Cp & Gld witnessed the maximum percentage decrease in stock prices over the 13 week period# Volatility and security/company
vs = data.groupby(["Security"])["Volatility"].mean().reset_index()
vs.sort_values('Volatility', ascending=False)
| Security | Volatility | |
|---|---|---|
| 278 | Southwestern Energy | 4.580 |
| 73 | Chesapeake Energy | 4.560 |
| 141 | Freeport-McMoran Cp & Gld | 3.796 |
| 329 | Wynn Resorts Ltd | 3.795 |
| 327 | Williams Cos. | 3.720 |
| ... | ... | ... |
| 262 | Republic Services Inc | 0.840 |
| 306 | United Parcel Service | 0.826 |
| 252 | Procter & Gamble | 0.806 |
| 243 | PepsiCo Inc. | 0.805 |
| 205 | McDonald's Corp. | 0.733 |
340 rows × 2 columns
Southwestern Energy, Chesapeake Energy and Freeport-McMoran Cp & Gld witnesed the maximum volatility in stock prices over a 13 week period.McDonald's Corp., PepsiCo Inc and Procter & Gamble witnessed the least volatility in stock prices over a 13 week period.# Current Price and industry sector
cps = data.groupby(["GICS Sector"])["Current Price"].mean().sort_values(ascending=False)
cps
GICS Sector Health Care 132.048 Consumer Discretionary 128.095 Real Estate 90.977 Materials 76.552 Industrials 74.412 Consumer Staples 71.973 Information Technology 63.548 Financials 58.659 Utilities 52.969 Energy 46.042 Telecommunications Services 32.964 Name: Current Price, dtype: float64
# Creating barplot of current price and industry sector
plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="GICS Sector", y="Current Price", palette="viridis", ci=None)
plt.title("Current Price by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
Health Care sector have the highest average current price (~ 132 dollars), followed closely by the Consumer Discretionary sector (~ 128 dollars)Telecommunications Services (~ 33 dollars) and Energy (~ 46 dollars) sectors# Current Price and sub-industry group
cpsb = data.groupby(["GICS Sub Industry"])["Current Price"].mean().sort_values(ascending=False)
cpsb
GICS Sub Industry Internet & Direct Marketing Retail 540.097 Life Sciences Tools & Services 339.130 Biotechnology 221.297 Restaurants 216.835 Industrial Materials 202.590 Home Furnishings 189.390 Data Processing & Outsourced Services 156.385 Household Appliances 146.870 Residential REITs 137.980 Specialty Chemicals 137.407 Specialty Stores 132.990 Managed Health Care 129.546 Health Care Equipment 127.925 Aerospace & Defense 126.625 Health Care Distributors 120.520 Construction Materials 115.775 Office REITs 112.980 Industrial Gases 102.400 Drug Retail 97.770 Health Care Supplies 97.525 Brewers 93.920 Asset Management & Custody Banks 92.937 Financial Exchanges & Data 90.600 Health Care Facilities 90.322 Specialized REITs 86.927 Specialty Retail 85.500 REITs 85.035 Research & Consulting Services 84.695 Industrial Conglomerates 81.727 Apparel, Accessories & Luxury Goods 80.610 Multi-line Insurance 80.540 Broadcasting & Cable TV 80.145 Personal Products 79.410 Household Products 78.787 Cable & Satellite 78.333 Retail REITs 76.408 Hotels, Resorts & Cruise Lines 73.845 Trucking 73.360 Tobacco 73.060 Metal & Glass Containers 72.730 Semiconductors 72.092 Diversified Chemicals 71.792 Soft Drinks 71.433 IT Consulting & Other Services 69.423 Casinos & Gaming 69.190 Diversified Financial Services 69.019 Air Freight & Logistics 67.783 Internet Software & Services 67.712 Industrial Machinery 66.722 Property & Casualty Insurance 65.922 Railroads 65.852 Packaged Foods & Meats 65.248 Auto Parts & Equipment 64.480 Construction & Farm Machinery & Heavy Trucks 63.877 Insurance Brokers 62.867 Application Software 60.180 Water Utilities 59.750 Oil & Gas Refining & Marketing & Transportation 58.097 Pharmaceuticals 55.098 Life & Health Insurance 55.090 Airlines 54.782 Alternative Carriers 54.360 Banks 53.839 Electrical Components & Equipment 53.590 MultiUtilities 53.498 Environmental Services 53.370 Electric Utilities 51.918 Advertising 49.470 Integrated Oil & Gas 48.668 Building Products 47.635 Consumer Finance 47.442 Leisure Products 47.265 Diversified Commercial Services 47.220 Human Resource & Employment Services 47.140 Motorcycle Manufacturers 45.390 Housewares & Specialties 44.080 Oil & Gas Exploration & Production 42.229 Paper Packaging 41.150 Steel 40.300 Home Entertainment Software 38.710 Oil & Gas Equipment & Services 37.893 Consumer Electronics 37.170 Agricultural Products 36.680 Electronic Components 35.255 Real Estate Services 34.580 Fertilizers & Agricultural Chemicals 34.200 Homebuilding 33.365 Tires & Rubber 32.670 Regional Banks 31.773 Investment Banking & Brokerage 31.285 Distributors 29.630 Electronic Equipment & Instruments 28.070 Integrated Telecommunications Services 27.615 Networking Equipment 27.600 Publishing 25.520 Automobile Manufacturers 24.050 Technology, Hardware, Software and Supplies 20.650 Semiconductor Equipment 18.670 Gold 17.990 Multi-Sector Holdings 17.390 Thrifts & Mortgage Finance 16.150 Technology Hardware, Storage & Peripherals 15.200 Computer Hardware 11.840 Copper 6.770 Name: Current Price, dtype: float64
# Creating barplot of current price and sub-industry group
plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="Current Price", palette="viridis", ci=None)
plt.title("Current Price by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=12)
plt.xticks(rotation=90)
plt.show()
With respect to the sub-industry groups, stocks belonging to the Internet & Direct Marketing Retail sub-group (Consumer Discretionary industry sector) have the highest average current price (~ 540 dollars)
Next is the Life Sciences Tools & Services sub-group and Biotechnology sub-group at ~ 339 dollars and ~ 221 dollars respectively, both of which belong to the Health Care industry group
Restaurants sub-group (Consumer Discretionary industry sector) comes in next with an average current stock price of ~ 217 dollars
On the other hand, the stocks of Copper sub-group (Materials industry sector) have the lowest average current stock price of ~7 dollars
Next on the lowest average current price are stocks of the Computer Hardware and Technology Hardware, Storage & Peripherals sub-groups at ~ 12 dollars and ~ 15 dollars respectively, both belonging to the Information Technology industry sector
Thrifts & Mortgage Finance and Multi-Sector Holdings sub-groups (both belonging to the Financials industry sector) come in next with average current stock prices of ~ 16 dollars and ~ 17 dollars respectively
# Price change and industry sector
pcs = data.groupby(["GICS Sector"])["Price Change"].mean().sort_values(ascending=False)
pcs
GICS Sector Health Care 9.586 Consumer Staples 8.685 Information Technology 7.217 Telecommunications Services 6.957 Real Estate 6.206 Consumer Discretionary 5.846 Materials 5.590 Financials 3.865 Industrials 2.833 Utilities 0.804 Energy -10.228 Name: Price Change, dtype: float64
# Creating lineplot of price change and industry sector
plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x="GICS Sector", y="Price Change", palette="viridis")
plt.title("Price Change by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# Price change and sub-industry group
pcsb = data.groupby(["GICS Sub Industry"])["Price Change"].mean().sort_values(ascending=False)
pcsb
GICS Sub Industry Casinos & Gaming 29.497 Semiconductor Equipment 26.834 Alternative Carriers 24.708 Home Entertainment Software 23.320 Life Sciences Tools & Services 18.943 Internet & Direct Marketing Retail 18.789 Advertising 18.316 Metal & Glass Containers 16.536 Health Care Distributors 16.108 Diversified Chemicals 15.593 Specialty Chemicals 15.436 Semiconductors 14.437 Investment Banking & Brokerage 14.060 Publishing 13.624 Biotechnology 13.497 Building Products 13.298 Specialized REITs 13.243 Brewers 13.129 Data Processing & Outsourced Services 11.823 Leisure Products 11.774 Packaged Foods & Meats 11.604 Pharmaceuticals 11.525 Gold 10.844 Diversified Commercial Services 10.819 Personal Products 10.661 Tires & Rubber 10.446 Soft Drinks 10.434 Housewares & Specialties 9.980 Airlines 9.502 Health Care Equipment 9.331 Tobacco 8.607 Water Utilities 8.597 Real Estate Services 8.198 Auto Parts & Equipment 7.790 Household Products 7.780 Residential REITs 7.707 Networking Equipment 7.351 Automobile Manufacturers 7.340 Property & Casualty Insurance 7.145 Broadcasting & Cable TV 7.144 Environmental Services 7.061 Steel 6.586 Managed Health Care 6.219 Banks 6.083 Internet Software & Services 5.993 Retail REITs 5.906 Hotels, Resorts & Cruise Lines 5.348 Health Care Supplies 5.115 Electronic Components 4.641 Distributors 4.441 REITs 4.369 Research & Consulting Services 4.206 Diversified Financial Services 4.203 Aerospace & Defense 4.136 Office REITs 4.004 Technology, Hardware, Software and Supplies 3.821 Regional Banks 3.672 Home Furnishings 3.514 Consumer Electronics 3.394 Industrial Conglomerates 3.275 Thrifts & Mortgage Finance 3.129 Insurance Brokers 3.109 Cable & Satellite 3.066 Trucking 2.961 Application Software 2.572 Integrated Telecommunications Services 2.519 Specialty Stores 2.449 Electrical Components & Equipment 2.212 Computer Hardware 2.162 Multi-line Insurance 1.898 Life & Health Insurance 1.854 Health Care Facilities 1.712 Drug Retail 1.327 Specialty Retail 1.303 MultiUtilities 1.176 Asset Management & Custody Banks 0.958 Industrial Gases 0.294 Electronic Equipment & Instruments 0.214 Industrial Machinery 0.010 IT Consulting & Other Services -0.157 Electric Utilities -0.187 Household Appliances -0.231 Construction & Farm Machinery & Heavy Trucks -0.605 Consumer Finance -0.838 Homebuilding -1.930 Oil & Gas Refining & Marketing & Transportation -2.370 Financial Exchanges & Data -2.402 Construction Materials -2.419 Paper Packaging -2.587 Industrial Materials -5.336 Air Freight & Logistics -5.418 Railroads -6.407 Integrated Oil & Gas -6.955 Restaurants -7.297 Human Resource & Employment Services -7.659 Oil & Gas Equipment & Services -9.991 Fertilizers & Agricultural Chemicals -10.240 Agricultural Products -12.017 Oil & Gas Exploration & Production -14.242 Multi-Sector Holdings -14.293 Apparel, Accessories & Luxury Goods -16.948 Motorcycle Manufacturers -17.247 Technology Hardware, Storage & Peripherals -17.838 Copper -31.685 Name: Price Change, dtype: float64
# Creating lineplot of price change and sub-industry group
plt.figure(figsize=(28, 6))
sns.lineplot(data=data, x="GICS Sub Industry", y="Price Change", palette="viridis")
plt.title("Price Change by Sub-Industry Group", fontsize=14) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()
Health Care sector have the highest percentage price change (increase) of ~ 9.5%, followed by the Consumer Staples sector at ~ 8.7%Information Technology, Telecommunications Services and Real Estate sectors with percentage price change of ~ 7.2%, ~ 6.9% and ~ 6.2% respectivelyEnergy sector saw a major price change (decrease) to the tune of ~ -10.2%Casinos & Gaming sub-group witnessed the highest percentage price change (increase) of ~ 29.5%, which is expected from this sector.Semiconductor Equipment, Alternative Carriers and Home Entertainment Software sub-sectors at ~ 27%, ~ 25% and ~ 23% respectively.Copper sub-group (Materials industry group) saw a major price change (decrease) to the tune of ~ -32%, followed by Technology Hardware, Storage & Peripherals, Motorcycle Manufacturers and Apparel, Accessories & Luxury Goods at ~ 18%, ~17.2% and ~17% respectively.# Creating lineplot of volatility and industry sector
plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x="GICS Sector", y="Volatility", palette="viridis")
plt.title("Volatility by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# Creating lineplot of volatility and sub-industry group
plt.figure(figsize=(28, 6))
sns.lineplot(data=data, x="GICS Sub Industry", y="Volatility", palette="viridis")
plt.title("Volatility by Sub-Industry Group", fontsize=14) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()
Energy sector have the highest volatility, of ~ 2.6, in the stock prices over the past 13 weeksMaterials, Information Technology, and Consumer Discretionary at 1.8, 1.7 and 1.6 respectivelyUtilities and Consumer Staples exhibited the least volatility in stock prices over the past 13 weeksCasinos & Gaming and Copper sub-groups witnessed the highest volatility, of ~ 3.8, in the stock prices over the past 13 weeksTechnology Hardware, Storage & Peripherals and Oil & Gas Exploration & Production at ~3.4 and ~ 2.9 respectively.Personal Products (Consumer Staples industry group) exhibited the least volatility in stock prices over the past 13 weeks# Cash ratio and industry sector
crs = data.groupby(["GICS Sector"])["Cash Ratio"].mean().sort_values(ascending=False)
crs
GICS Sector Information Technology 149.818 Telecommunications Services 117.000 Health Care 103.775 Financials 98.592 Consumer Staples 70.947 Energy 51.133 Real Estate 50.111 Consumer Discretionary 49.575 Materials 41.700 Industrials 36.189 Utilities 13.625 Name: Cash Ratio, dtype: float64
# Creating pointplot of cash ratio and industry sector
plt.figure(figsize=(12, 6))
sns.pointplot(data=data, x="GICS Sector", y="Cash Ratio", palette="viridis")
plt.title("Cash Ratio by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# Cash ratio and sub-industry group
crsb = data.groupby(["GICS Sub Industry"])["Cash Ratio"].mean().sort_values(ascending=False)
crsb
GICS Sub Industry Biotechnology 217.286 Internet Software & Services 205.417 Gold 198.000 Casinos & Gaming 198.000 Soft Drinks 192.000 Semiconductors 189.667 Restaurants 174.667 Electronic Components 169.000 Health Care Distributors 160.333 Steel 147.000 Application Software 134.000 Integrated Telecommunications Services 131.000 Semiconductor Equipment 131.000 Personal Products 129.000 Consumer Electronics 121.000 Internet & Direct Marketing Retail 120.750 Diversified Financial Services 117.000 Health Care Equipment 108.364 Networking Equipment 108.000 Oil & Gas Equipment & Services 107.333 Multi-line Insurance 99.000 Insurance Brokers 99.000 Financial Exchanges & Data 99.000 Investment Banking & Brokerage 99.000 Life & Health Insurance 99.000 Thrifts & Mortgage Finance 99.000 Regional Banks 99.000 Banks 99.000 Consumer Finance 99.000 Property & Casualty Insurance 99.000 Multi-Sector Holdings 81.000 Electronic Equipment & Instruments 81.000 IT Consulting & Other Services 77.333 Diversified Commercial Services 73.000 Leisure Products 73.000 Home Entertainment Software 70.000 Managed Health Care 70.000 Asset Management & Custody Banks 67.000 Construction Materials 63.500 Pharmaceuticals 62.500 Alternative Carriers 61.000 REITs 57.214 Oil & Gas Exploration & Production 52.812 Construction & Farm Machinery & Heavy Trucks 52.333 Air Freight & Logistics 50.000 Agricultural Products 49.000 Airlines 47.800 Industrial Machinery 47.800 Residential REITs 47.000 Retail REITs 47.000 Office REITs 47.000 Railroads 46.000 Technology Hardware, Storage & Peripherals 45.000 Fertilizers & Agricultural Chemicals 43.500 Building Products 43.000 Integrated Oil & Gas 40.400 Specialized REITs 39.000 Automobile Manufacturers 38.000 Packaged Foods & Meats 37.667 Electrical Components & Equipment 37.000 Brewers 35.000 Technology, Hardware, Software and Supplies 34.000 Human Resource & Employment Services 34.000 Health Care Supplies 31.500 Diversified Chemicals 31.200 Motorcycle Manufacturers 31.000 Health Care Facilities 30.200 Tires & Rubber 30.000 Industrial Conglomerates 28.929 Oil & Gas Refining & Marketing & Transportation 27.500 Tobacco 27.500 Apparel, Accessories & Luxury Goods 27.000 Household Products 25.000 Homebuilding 25.000 Data Processing & Outsourced Services 25.000 Broadcasting & Cable TV 24.500 Paper Packaging 23.000 Aerospace & Defense 22.750 Research & Consulting Services 21.000 Publishing 21.000 Auto Parts & Equipment 19.500 Advertising 19.000 Life Sciences Tools & Services 18.000 Computer Hardware 18.000 Cable & Satellite 16.667 MultiUtilities 16.273 Industrial Materials 16.000 Housewares & Specialties 14.000 Specialty Chemicals 12.500 Real Estate Services 12.000 Electric Utilities 12.000 Distributors 12.000 Drug Retail 11.000 Hotels, Resorts & Cruise Lines 10.250 Metal & Glass Containers 10.000 Household Appliances 10.000 Specialty Retail 10.000 Industrial Gases 8.000 Copper 5.000 Water Utilities 4.000 Specialty Stores 3.333 Home Furnishings 3.000 Environmental Services 2.000 Trucking 1.000 Name: Cash Ratio, dtype: float64
# Creating pointplot of cash ratio and sub-industry group
plt.figure(figsize=(28, 6))
sns.pointplot(data=data, x="GICS Sub Industry", y="Cash Ratio", palette="viridis")
plt.title("Cash Ratio by Sub-Industry Group", fontsize=14) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()
Information Technology sector have the highest cash ratio of ~ 150. This suggests that companies belonging to this sector have the ability to cover their short-term obligations using only cash and cash equivalentsTelecommunications Services and Health Care at ~ 117 and ~ 104 respectivelyUtilities and Industrials sectors have the lowest cash ratios of 14 and 36 respectively, which indicates that majority of companies in these sectors have low liquidityBiotachnology sub-group have the highest cash ratio of ~ 217, followed closely by Internet Software & Services at ~ 205Casinos & Gaming and Gold sub-groups at ~ 198 each and Soft Drinks at ~ 192Trucking sub-group (Industrials sector) had the lowest cash ratio of ~ 1, Environmental Services stood at ~ 2, whereas it was ~ 3 each for Specality Stores and Home Furnishings sub-groups# Creating barplot of net income and industry sector
plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="GICS Sector", y="Net Income", palette="viridis", ci=None)
plt.title("Net Income by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# Creating barplot of net income and sub-industry group
plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="Net Income", palette="viridis", ci=None)
plt.title("Net Income by Sub-Industry Group", fontsize=14) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=12)
plt.xticks(rotation=90)
plt.show()
Telecommunications Services sector have the highest net income, followed by Financials, Consumer Staples and Health Care sectorsEnergy sector exhibited negative net income, signifying potential lossesBanks sub-group have the highest net income, followed by Automobile Manufacturers and Integrated Telecommunications ServicesCopper, Oil & Gas Exploration & Production and Oil & Gas Equipment & Services are sub-groups that witnessed net lossesMaterials sector to which it belongs has a positive net income, signifying good performance of other sub-groups in the materials sector# Creating lineplot of Earnings Per Share and industry sector
plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x="GICS Sector", y="Earnings Per Share", palette="viridis")
plt.title("Earnings Per Share by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# Creating lineplot of Earnings Per Share and sub-industry group
plt.figure(figsize=(28, 6))
sns.lineplot(data=data, x="GICS Sub Industry", y="Earnings Per Share", palette="viridis")
plt.title("Earnings Per Share by Sub-Industry Group", fontsize=16) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()
Health Care and Consumer Discretionary sectors have the highest EPS, i.e., the stocks of companies in this sector are profitableEnergy sector exhibited negative EPS, in line with the negative net income as noted above# P/E ratio and industry sector
pes = data.groupby(["GICS Sector"])["P/E Ratio"].mean().sort_values(ascending=False)
pes
GICS Sector Energy 72.898 Information Technology 43.783 Real Estate 43.066 Health Care 41.135 Consumer Discretionary 35.212 Consumer Staples 25.521 Materials 24.585 Utilities 18.719 Industrials 18.259 Financials 16.023 Telecommunications Services 12.223 Name: P/E Ratio, dtype: float64
# Creating barplot of P/E ratio and industry sector
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x="GICS Sector", y="P/E Ratio")
plt.title("P/E Ratio by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# P/E ratio and industry sector
pesb = data.groupby(["GICS Sub Industry"])["P/E Ratio"].mean().sort_values(ascending=False)
pesb
GICS Sub Industry Internet & Direct Marketing Retail 159.111 Office REITs 110.765 Oil & Gas Exploration & Production 93.518 Oil & Gas Equipment & Services 93.089 Application Software 74.556 Biotechnology 73.319 Home Furnishings 73.124 Internet Software & Services 61.536 Specialized REITs 57.526 Brewers 48.412 Oil & Gas Refining & Marketing & Transportation 44.889 Construction Materials 44.450 Health Care Equipment 43.766 Semiconductors 43.431 Retail REITs 42.248 Gold 41.837 Health Care Distributors 41.387 Multi-line Insurance 38.721 REITs 38.444 Residential REITs 37.783 Pharmaceuticals 36.971 Steel 36.306 Casinos & Gaming 35.850 Metal & Glass Containers 35.478 Household Products 34.298 Health Care Supplies 33.994 Housewares & Specialties 33.908 Data Processing & Outsourced Services 33.620 Environmental Services 32.151 Home Entertainment Software 31.992 Investment Banking & Brokerage 31.940 Building Products 29.978 Tires & Rubber 28.658 Integrated Oil & Gas 28.408 Specialty Retail 28.218 Research & Consulting Services 27.041 Life Sciences Tools & Services 26.598 Computer Hardware 25.310 Technology Hardware, Storage & Peripherals 25.310 Soft Drinks 25.268 Specialty Chemicals 24.578 Restaurants 24.518 Financial Exchanges & Data 24.420 Personal Products 24.070 Multi-Sector Holdings 23.500 Health Care Facilities 23.150 Hotels, Resorts & Cruise Lines 23.088 Copper 22.812 Water Utilities 22.462 Packaged Foods & Meats 22.253 Paper Packaging 22.059 Leisure Products 21.908 Electrical Components & Equipment 21.785 Real Estate Services 21.085 Drug Retail 20.981 Tobacco 20.845 Distributors 20.820 Cable & Satellite 20.820 Apparel, Accessories & Luxury Goods 20.820 Specialty Stores 20.281 Trucking 19.881 Electronic Components 19.534 IT Consulting & Other Services 19.333 Advertising 19.026 Industrial Gases 18.998 Electric Utilities 18.996 Insurance Brokers 18.968 Managed Health Care 18.888 Thrifts & Mortgage Finance 18.779 Industrial Conglomerates 18.370 MultiUtilities 18.077 Air Freight & Logistics 18.047 Regional Banks 17.974 Aerospace & Defense 17.871 Human Resource & Employment Services 17.331 Industrial Materials 17.330 Networking Equipment 17.037 Diversified Commercial Services 16.568 Semiconductor Equipment 16.522 Broadcasting & Cable TV 16.513 Auto Parts & Equipment 16.385 Diversified Chemicals 16.263 Electronic Equipment & Instruments 16.225 Diversified Financial Services 15.553 Consumer Electronics 15.552 Asset Management & Custody Banks 15.478 Property & Casualty Insurance 15.313 Industrial Machinery 15.156 Railroads 15.147 Construction & Farm Machinery & Heavy Trucks 14.842 Household Appliances 14.761 Integrated Telecommunications Services 13.879 Banks 13.097 Homebuilding 12.776 Publishing 12.510 Agricultural Products 12.268 Motorcycle Manufacturers 12.235 Fertilizers & Agricultural Chemicals 11.815 Life & Health Insurance 11.407 Technology, Hardware, Software and Supplies 10.123 Consumer Finance 9.305 Airlines 8.161 Automobile Manufacturers 6.571 Alternative Carriers 5.598 Name: P/E Ratio, dtype: float64
# Creating barplot of P/E ratio and sub-industry group
plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="P/E Ratio", ci=None)
plt.title("P/E Ratio by Sub-Industry Group", fontsize=16) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()
Energy sector have the highest P/E ratio and also the highest variance in P/E ratios, as indicated by the whiskers of the boxplot. P/E ratio is a measure of valuationTelecommunication Services and Financials are the lowest, suggesting lower valuationInternet & Direct Marketing Retail (belonging to Consumer Discretionary sector) have the highest P/E ratioOffice REITs, Oil & Gas Exploration & Production and Oil & Gas Equipment & Services sub-industry groups, again indicating investor willingness to invest in shares of these sub-groups than in shares of other sub-groups# Creating barplot of P/B ratio and industry sector
plt.figure(figsize=(12, 6))
sns.barplot(data=data, x="GICS Sector", y="P/B Ratio", palette="viridis", ci=None)
plt.title("P/B Ratio by Industry Sector", fontsize=14) # Setting plot title
plt.xlabel("GICS Sector", fontsize=12)
plt.xticks(rotation=90)
plt.show()
# Creating barplot of P/B ratio and sub-industry group
plt.figure(figsize=(28, 6))
sns.barplot(data=data, x="GICS Sub Industry", y="P/B Ratio", palette="viridis", ci=None)
plt.title("P/B Ratio by Sub-Industry Group", fontsize=16) # Setting plot title
plt.xlabel("GICS Sub Industry", fontsize=14)
plt.xticks(rotation=90)
plt.show()
Telecommunications Services, Consumer Discretionary, Consumer Staples and Financials exhibited a negative P/B ratio, i.e., a negative ratio of the company's stock price per share by its book value per share, which indicates undervalued stockInformation Technology, Energy and Materials have a positive P/B ratio, suggesting that the stock price is trading at a premium to the company's book value.Data Processing & Outsourced Services sub-group had the highest P/B ratio, while Cable & Satellite, Financial Exchanges & Data and Household Appliances have negative P/B ratios.sns.pairplot(data=data[num_col], diag_kind="kde")
plt.show()
plt.figure(figsize=(15, 7))
sns.heatmap(data[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="viridis_r")
plt.show()
P/E Ratio and Earnings Per Share appear to be bimodal and that of Volatility is trimodalEarnings Per Share has a positive correlation with Current Price (0.48) & Net Income (0.56)Earnings Per Share has a negative correlation with Volatility (-0.38) & ROE (-0.41)Net Income has a positive correlation with Estimated Shares Outstanding (0.59)Volatility has a negative correlation with Price Change (-0.41), signifying that the more volatile the stocks, the more negative the Price Change# checking for duplicate values
data.duplicated().sum()
0
# checking for missing values in the data
data.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
no duplicated or missing values in the dataset.plt.figure(figsize=(15, 12))
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Scaling the data set before clustering
scaler = StandardScaler()
subset = data[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
# creating a dataframe of the scaled columns
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
# Checking for scaled values
subset_scaled_df.head(10)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393 | 0.494 | 0.273 | 0.990 | -0.211 | -0.339 | 1.554 | 1.309 | 0.108 | -0.652 | -0.507 |
| 1 | -0.221 | 0.355 | 1.137 | 0.938 | 0.077 | -0.002 | 0.928 | 0.057 | 1.250 | -0.312 | -0.504 |
| 2 | -0.367 | 0.602 | -0.427 | -0.193 | -0.033 | 0.454 | 0.744 | 0.025 | 1.098 | -0.392 | 0.095 |
| 3 | 0.134 | 0.826 | -0.285 | -0.317 | 1.218 | -0.152 | -0.220 | -0.231 | -0.092 | 0.947 | 0.424 |
| 4 | -0.261 | -0.493 | 0.296 | -0.266 | 2.237 | 0.134 | -0.203 | -0.375 | 1.978 | 3.293 | 0.199 |
| 5 | -0.451 | -1.343 | -0.016 | -0.307 | -0.233 | -0.126 | 0.090 | 0.032 | 0.049 | -0.459 | 0.661 |
| 6 | 1.999 | 0.176 | -0.692 | -0.100 | -0.499 | 0.018 | -0.228 | 0.932 | -0.604 | -0.036 | 9.378 |
| 7 | -0.384 | -0.159 | -0.680 | -0.317 | -0.620 | 0.119 | -0.218 | -0.027 | -0.394 | -0.361 | 0.072 |
| 8 | -0.231 | -0.142 | -0.774 | -0.297 | -0.676 | -0.021 | 0.142 | 0.054 | -0.184 | -0.320 | -0.094 |
| 9 | -0.214 | -0.088 | -0.808 | -0.266 | 0.321 | -0.187 | 0.264 | 0.472 | -0.173 | -0.506 | -0.012 |
Ticker Symbol and Security for analysis, so let's drop these columns¶data.drop(["Ticker Symbol"], axis=1, inplace=True)
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.000 | NaN | NaN | NaN | 80.862 | 98.055 | 4.500 | 38.555 | 59.705 | 92.880 | 1274.950 |
| Price Change | 340.000 | NaN | NaN | NaN | 4.078 | 12.006 | -47.130 | -0.939 | 4.820 | 10.695 | 55.052 |
| Volatility | 340.000 | NaN | NaN | NaN | 1.526 | 0.592 | 0.733 | 1.135 | 1.386 | 1.696 | 4.580 |
| ROE | 340.000 | NaN | NaN | NaN | 39.597 | 96.548 | 1.000 | 9.750 | 15.000 | 27.000 | 917.000 |
| Cash Ratio | 340.000 | NaN | NaN | NaN | 70.024 | 90.421 | 0.000 | 18.000 | 47.000 | 99.000 | 958.000 |
| Net Cash Flow | 340.000 | NaN | NaN | NaN | 55537620.588 | 1946365312.176 | -11208000000.000 | -193906500.000 | 2098000.000 | 169810750.000 | 20764000000.000 |
| Net Income | 340.000 | NaN | NaN | NaN | 1494384602.941 | 3940150279.328 | -23528000000.000 | 352301250.000 | 707336000.000 | 1899000000.000 | 24442000000.000 |
| Earnings Per Share | 340.000 | NaN | NaN | NaN | 2.777 | 6.588 | -61.200 | 1.558 | 2.895 | 4.620 | 50.090 |
| Estimated Shares Outstanding | 340.000 | NaN | NaN | NaN | 577028337.754 | 845849595.418 | 27672156.860 | 158848216.100 | 309675137.800 | 573117457.325 | 6159292035.000 |
| P/E Ratio | 340.000 | NaN | NaN | NaN | 32.613 | 44.349 | 2.935 | 15.045 | 20.820 | 31.765 | 528.039 |
| P/B Ratio | 340.000 | NaN | NaN | NaN | -1.718 | 13.967 | -76.119 | -4.352 | -1.067 | 3.917 | 129.065 |
k_means_df = subset_scaled_df.copy()
%%time
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
plt.show()
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2692367155390745 Number of Clusters: 4 Average Distortion: 2.1745559827866363 Number of Clusters: 5 Average Distortion: 2.128799332840716 Number of Clusters: 6 Average Distortion: 2.080400099226289 Number of Clusters: 7 Average Distortion: 2.0289794220177395 Number of Clusters: 8 Average Distortion: 1.964144163389972 Number of Clusters: 9 Average Distortion: 1.9221492045198068 Number of Clusters: 10 Average Distortion: 1.8513913649973124 Number of Clusters: 11 Average Distortion: 1.8024134734578485 Number of Clusters: 12 Average Distortion: 1.7900931879652673 Number of Clusters: 13 Average Distortion: 1.7417609203336912 Number of Clusters: 14 Average Distortion: 1.673559857259703
CPU times: user 1.01 s, sys: 672 ms, total: 1.68 s Wall time: 463 ms
# checking silhoutte score
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((k_means_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457) For n_clusters = 3, the silhouette score is 0.4644405674779404) For n_clusters = 4, the silhouette score is 0.4577225970476733) For n_clusters = 5, the silhouette score is 0.43228336443659804) For n_clusters = 6, the silhouette score is 0.4005422737213617) For n_clusters = 7, the silhouette score is 0.3976335364987305) For n_clusters = 8, the silhouette score is 0.40278401969450467) For n_clusters = 9, the silhouette score is 0.3778585981433699) For n_clusters = 10, the silhouette score is 0.13458938329968687) For n_clusters = 11, the silhouette score is 0.1421832155528444) For n_clusters = 12, the silhouette score is 0.2044669621527429) For n_clusters = 13, the silhouette score is 0.23424874810104204) For n_clusters = 14, the silhouette score is 0.12102526472829901)
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
%%time
# final K-means model with k=5
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(k_means_df)
CPU times: user 116 ms, sys: 499 ms, total: 615 ms Wall time: 71.9 ms
KMeans(n_clusters=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=1)
# creating a copy of the original data
df1 = data.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["K_means_segments"] = kmeans.labels_
df1["K_means_segments"] = kmeans.labels_
km_cluster_profile = df1.groupby("K_means_segments").mean()
km_cluster_profile["count_in_each_segment"] = (
df1.groupby("K_means_segments")["Security"].count().values)
km_cluster_profile.style.highlight_max(color="yellow", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 65.174668 | -11.542247 | 2.690220 | 37.300000 | 65.366667 | 195008366.666667 | -1677736033.333333 | -4.401667 | 544473664.718000 | 113.488924 | 1.424161 | 30 |
| 1 | 72.738269 | 5.179897 | 1.380738 | 34.825455 | 53.138182 | -10147287.272727 | 1488641570.909091 | 3.636164 | 437961614.918582 | 23.680917 | -3.395254 | 275 |
| 2 | 233.251108 | 13.682869 | 1.719008 | 29.333333 | 296.523810 | 1398716380.952381 | 1835686380.952381 | 7.126190 | 508721791.962857 | 37.805996 | 16.758218 | 21 |
| 3 | 50.517273 | 5.747586 | 1.130399 | 31.090909 | 75.909091 | -1072272727.272727 | 14833090909.090910 | 4.154545 | 4298826628.727273 | 14.803577 | -4.552119 | 11 |
| 4 | 26.990000 | -14.060688 | 3.296307 | 603.000000 | 57.333333 | -585000000.000000 | -17555666666.666668 | -39.726667 | 481910081.666667 | 71.528835 | 1.638633 | 3 |
# Printing the companies in each cluster
for cl in df1["K_means_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df1[df1["K_means_segments"] == cl]["Security"].unique())
print()
In cluster 1, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 0, the following companies are present: ['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas' 'Concho Resources' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'The Mosaic Company' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] In cluster 2, the following companies are present: ['Alliance Data Systems' 'Amgen Inc' 'Bank of America Corp' 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] In cluster 4, the following companies are present: ['Apache Corporation' 'Chesapeake Energy' 'Devon Energy Corp.'] In cluster 3, the following companies are present: ['Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
df1.groupby(["K_means_segments", "GICS Sector"])['Security'].count()
K_means_segments GICS Sector
0 Consumer Discretionary 1
Energy 21
Health Care 1
Industrials 1
Information Technology 4
Materials 2
1 Consumer Discretionary 33
Consumer Staples 17
Energy 5
Financials 45
Health Care 29
Industrials 52
Information Technology 24
Materials 18
Real Estate 26
Telecommunications Services 2
Utilities 24
2 Consumer Discretionary 5
Consumer Staples 1
Financials 1
Health Care 8
Information Technology 4
Real Estate 1
Telecommunications Services 1
3 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 2
Information Technology 1
Telecommunications Services 2
4 Energy 3
Name: Security, dtype: int64
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=k_means_df, x="K_means_segments", y=variable)
plt.tight_layout(pad=2.0)
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of original variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df1, x="K_means_segments", y=variable)
plt.tight_layout(pad=2.0)
Cluster 0 has 30 companies/securities, Cluster 1 has 275 companies, Cluster 2 has 21 companies, Cluster 3 has 11 companies, while Cluster 4 has 3 companiesEnergy sectorCurrent Price, the stocks in this cluster were priced under 200 dollars, with a mean price of ~ 65 dollarsnegative price changes, with moderate to high volatility (average volatility is ~ 2.7)low ROE (average ~ 37) and low cash ratio (average ~ 65)positive average net cash flow, suggesting cash inflows > cash outflows. However, the average net income as well as the EPS are negativeP/E ratio is high (average ~ 113), which indicates the willingness of investors to invest in stocks in this clusterP/B ratio is low, suggesting possible undervalued stocksIndustrials, Financials and Consumer Discretionary sectorsCurrent Price, the majority of stocks in this cluster were priced under 200 dollars, with a mean price of ~ 73 dollarspositive price changes, with low volatility (average volatility is ~ 1.3)low ROE (average ~ 35) and low cash ratio (average ~ 53)negative average net cash flow, suggesting cash outflows > cash inflows. However, the average net income as well as the EPS are positiveP/E ratio is low but positive (~ 24), while the average P/B ratio is negative, suggesting undervalued stocksHealth Care, Consumer Discretionary and Information Technology sectorshighest current stock prices, with an average current price of ~ 233 dollarsmaximum positive price changes as compared to other clusters (average ~ 14%), with low volatility (average ~1.7)low ROE (average ~ 29) but the highest cash ratio (average ~ 296) among all clusters, indicating the ability of companies to cover their short-term obligations using only cash and cash equivalentshigh and positive average net cash flow, suggesting cash inflows > cash outflows. Also, the average net income is positivehighest average EPS (~ 7)P/E ratio is low but positive and greater than cluster 1 (~ 38)P/B ratio is the highest of all clusters (~ 17), suggesting that the stock price for stocks in this cluster is trading at a premium to the company's book valueFinancials, Health Care and Telecommunications Services sectorsCurrent Price, the majority of stocks in this cluster were priced under 100 dollars, with a mean price of ~ 50 dollarspositive price changes, with low volatility (average volatility is ~ 1.1)low ROE (average ~ 31) and an average cash ratio of ~ 75, which is second best after cluster 2negative average net cash flow, suggesting cash outflows > cash inflowsnet income is the highest of all clusters as well as the EPS are positive (~ 4)P/E ratio is low but positive (~ 15), while the average P/B ratio is negative (~ -4.5), suggesting undervalued stocksEnergy sectorlowest average current price (~ 27 dollars) and negative average price changes (~ 14) and high average volatility (~ 3.3), as was also noted in EDAhighest average ROE (~ 603)net cash flow, net income and earnings per share are negativeP/E ratio is high for this cluster (energy sector) (~ 72), and the average P/B is positive (~ 1.6), indicative of no undervaluation in stock price, hence being attractive investment opportunities for investorshc_df = subset_scaled_df.copy()
%%time
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524591. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850004. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281. CPU times: user 208 ms, sys: 80.8 ms, total: 289 ms Wall time: 45.3 ms
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print("Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]))
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
%%time
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404. CPU times: user 109 ms, sys: 84.2 ms, total: 194 ms Wall time: 22.4 ms
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print("Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(high_cophenet_corr, high_dm_lm[1]))
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
%%time
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction")
compare.append([method, coph_corr])
CPU times: user 1.01 s, sys: 1.24 s, total: 2.25 s Wall time: 872 ms
# let's create a dataframe to compare cophenetic correlations for each linkage method
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 0 | single | 0.923 |
| 1 | complete | 0.787 |
| 2 | average | 0.942 |
| 3 | centroid | 0.931 |
| 4 | ward | 0.710 |
| 5 | weighted | 0.869 |
%%time
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
CPU times: user 2.22 ms, sys: 864 µs, total: 3.09 ms Wall time: 2.26 ms
AgglomerativeClustering(linkage='average', n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(linkage='average', n_clusters=5)
# creating a copy of the original data
df2 = data.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
hc_cluster_profile = df2.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_segments")["Security"].count().values)
hc_cluster_profile.style.highlight_max(color="yellow", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 77.884243 | 4.105986 | 1.516865 | 35.320359 | 66.775449 | -32825817.365269 | 1535255703.592814 | 2.903308 | 559027333.145509 | 32.437511 | -1.781988 | 334 |
| 1 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 16755500000.000000 | 13654000000.000000 | 3.295000 | 2791829362.100000 | 13.649696 | 1.508484 | 2 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
| 4 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
## Complete the code to print the companies in each cluster
for cl in df2["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df2[df2["HC_segments"] == cl]["Security"].unique())
print()
In cluster 0, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Alliance Data Systems' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Alexion Pharmaceuticals' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Amgen Inc' 'Ameriprise Financial' 'American Tower Corp A' 'Amazon.com Inc' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Anadarko Petroleum Corp' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'Baker Hughes Inc' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Citigroup Inc.' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Chipotle Mexican Grill' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'Cabot Oil & Gas' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Concho Resources' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'Devon Energy Corp.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'EOG Resources' 'Equinix' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Ford Motor' 'Fastenal Co' 'Fortune Brands Home & Security' 'Freeport-McMoran Cp & Gld' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'Frontier Communications' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hess Corporation' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'JPMorgan Chase & Co.' 'Kimco Realty' 'Kimberly-Clark' 'Kinder Morgan' 'Coca Cola Company' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Monster Beverage' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'Marathon Oil Corp.' 'M&T Bank Corp.' 'Mettler Toledo' 'Murphy Oil' 'Mylan N.V.' 'Navient' 'Noble Energy Inc' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.' 'Newfield Exploration Co' 'Nielsen Holdings' 'National Oilwell Varco Inc.' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'ONEOK' 'Omnicom Group' "O'Reilly Automotive" 'Occidental Petroleum' "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Pfizer Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International' 'Roper Industries' 'Range Resources Corp.' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Southwestern Energy' 'Synchrony Financial' 'Stryker Corp.' 'AT&T Inc' 'Molson Coors Brewing Company' 'Teradata Corp.' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Verizon Communications' 'Waters Corporation' 'Wec Energy Group Inc' 'Wells Fargo' 'Whirlpool Corp.' 'Waste Management Inc.' 'Williams Cos.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Cimarex Energy' 'Xcel Energy Inc' 'XL Capital' 'Exxon Mobil Corp.' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yahoo Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 2, the following companies are present: ['Apache Corporation' 'Chesapeake Energy'] In cluster 1, the following companies are present: ['Bank of America Corp' 'Intel Corp.'] In cluster 3, the following companies are present: ['Facebook'] In cluster 4, the following companies are present: ['Priceline.com Inc']
df2.groupby(["HC_segments", "GICS Sector"])['Security'].count()
HC_segments GICS Sector
0 Consumer Discretionary 39
Consumer Staples 19
Energy 28
Financials 48
Health Care 40
Industrials 53
Information Technology 31
Materials 20
Real Estate 27
Telecommunications Services 5
Utilities 24
1 Financials 1
Information Technology 1
2 Energy 2
3 Information Technology 1
4 Consumer Discretionary 1
Name: Security, dtype: int64
Average Linkage method does not result in representative and distributed clusters, as Cluster 0 comprises 334 securities, Cluster 1 & 2 consist of 2 companies each, while Clusters 3 & 4 have only 1 security eachWard Linkage method has more distributed clusters and hence, let's move forward with this method%%time
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="ward")
HCmodel.fit(hc_df)
CPU times: user 2.6 ms, sys: 744 µs, total: 3.34 ms Wall time: 2.61 ms
AgglomerativeClustering(n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(n_clusters=5)
# creating a copy of the original data
df3 = data.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_segments"] = HCmodel.labels_
df3["HC_segments"] = HCmodel.labels_
hc_cluster_profile = df3.groupby("HC_segments").mean()
hc_cluster_profile["count_in_each_segment"] = (
df3.groupby("HC_segments")["Security"].count().values)
hc_cluster_profile.style.highlight_max(color="yellow", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 213.518640 | 15.252913 | 1.779861 | 22.333333 | 258.740741 | 1504052814.814815 | 1716529851.851852 | 5.177407 | 689838338.441482 | 78.441603 | 13.022590 | 27 |
| 1 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
| 2 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3040666666.666667 | 14848444444.444445 | 3.435556 | 4564959946.222222 | 15.596051 | -6.354193 | 9 |
| 3 | 72.421687 | 4.563230 | 1.403434 | 25.218182 | 55.014545 | 72801872.727273 | 1572467469.090909 | 3.728564 | 445003946.148764 | 24.188244 | -2.966949 | 275 |
| 4 | 36.440455 | -16.073408 | 2.832884 | 57.500000 | 42.409091 | -472834090.909091 | -3161045227.272727 | -8.005000 | 514367806.201818 | 85.555682 | 0.836839 | 22 |
## Complete the code to print the companies in each cluster
for cl in df3["HC_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df3[df3["HC_segments"] == cl]["Security"].unique())
print()
In cluster 3, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 0, the following companies are present: ['Adobe Systems Inc' 'Alliance Data Systems' 'Alexion Pharmaceuticals' 'Applied Materials Inc' 'Amgen Inc' 'Amazon.com Inc' 'Broadcom' 'Bank of America Corp' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Intel Corp.' 'Intuitive Surgical Inc.' 'Monster Beverage' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] In cluster 1, the following companies are present: ['Allegion' 'Apache Corporation' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark' 'S&P Global, Inc.'] In cluster 4, the following companies are present: ['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] In cluster 2, the following companies are present: ['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
df3.groupby(["HC_segments", "GICS Sector"])['Security'].count()
HC_segments GICS Sector
0 Consumer Discretionary 5
Consumer Staples 1
Financials 1
Health Care 8
Information Technology 9
Materials 1
Real Estate 1
Telecommunications Services 1
1 Consumer Discretionary 1
Consumer Staples 2
Energy 2
Financials 1
Industrials 1
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Telecommunications Services 2
3 Consumer Discretionary 33
Consumer Staples 15
Energy 7
Financials 44
Health Care 31
Industrials 52
Information Technology 23
Materials 18
Real Estate 26
Telecommunications Services 2
Utilities 24
4 Energy 20
Information Technology 1
Materials 1
Name: Security, dtype: int64
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=hc_df, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of original numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df3, x="HC_segments", y=variable)
plt.tight_layout(pad=2.0)
Cluster 0 has 27 companies/securities, Cluster 1 has 7 companies, Cluster 2 has 9 companies, Cluster 3 has 275 companies, while Cluster 4 has 22 companiesInformation Technology, Health Care, Consumer Discretionary sectorshighest average current price (~ 214), with high positive price changes (average ~ 15%) and low average volatility (~ 1.8)low ROE (average ~ 22), but the highest cash ratio (average ~ 259) among all clusters, indicating the ability of companies to cover their short-term obligations using only cash and cash equivalentspositive average net cash flow, suggesting cash inflows > cash outflows. Also, the average net income as well as the EPS are positiveP/E ratio is high (average ~ 78), which indicates the willingness of investors to invest in stocks in this clusterhighest average P/B ratio, suggesting that the stock price for stocks in this cluster is trading at a premium to the company's book valueEnergy and Comsumer staples sectorsCurrent Price, the majority of stocks in this cluster were priced under 200 dollars, with a mean price of ~ 84 dollarspositive price changes, with low volatility (average volatility is ~ 1.8)highest average ROE (average ~ 634) and low cash ratio (average ~ 33)negative average net cash flow, suggesting cash outflows > cash inflows. Further, the average net income as well as the EPS are also negativeP/E ratio is moderate (~ 42) as compared to other clusters, while the average P/B ratio is negative, suggesting undervalued stocksFinancials and Telecommunications Services sectorsCurrent Price, the stocks in this cluster were priced low (average ~ 47 dollars), with positive average price changes (~ 5%) and low volatility (~ 1)low ROE (average ~ 25) and an average cash ratio of ~ 58negative average net cash flow, suggesting cash outflows > cash inflowshighest positive average net income, suggesting revenues other than cashlow average EPS (~ 3)P/E ratio is positive but the lowest of all clusters (~16), while the average P/B ratio is negative, suggesting undervalued stocksIndustrials, Financials, Consumer Discretionary and Health Care sectorsCurrent Price, the majority of stocks in this cluster were priced under 100 dollars, with a mean price of ~ 72 dollarspositive price changes, with low volatility (average volatility is ~ 1.4)low ROE (average ~ 25) and an average cash ratio of ~ 55, indicating low liquidityaverage net cash flow, net income and EPS are positiveP/E ratio is low but positive (~ 24), while the average P/B ratio is negative (~ -2.9), suggesting undervalued stocksEnergy sectorlowest average current price (~ 36), with negative price changes (average ~ -16%) and highest average volatility (~ 2.8) among all clusterssecond best average ROE after cluster 1 (~ 57) and low cash ratio (average ~ 42)net cash flow, net income and earnings per share are negativeP/E ratio is the highest for this cluster (energy sector) (~ 85), and the average P/B is low but positive (~ 1), indicative of no undervaluation in stock price, hence being attractive investment opportunities for investorsLet's use PCA to reduce the data to two dimensions and visualize it to see how well-separated the clusters are
# importing library
from sklearn.decomposition import PCA
# setting the number of components to 2
pca = PCA(n_components=2)
# transforming data and storing results in a dataframe
X_reduced_pca = pca.fit_transform(subset_scaled_df)
reduced_df_pca = pd.DataFrame(
data=X_reduced_pca, columns=["Component 1", "Component 2"])
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
0.3712899567725857
sns.scatterplot(data=reduced_df_pca, x="Component 1", y="Component 2")
<AxesSubplot:xlabel='Component 1', ylabel='Component 2'>
sns.scatterplot(
data=reduced_df_pca,
x="Component 1",
y="Component 2",
hue=df3["HC_segments"],
palette="rainbow",
)
plt.legend(bbox_to_anchor=(1, 1))
<matplotlib.legend.Legend at 0x2a79597f0>
quick and were able to fit the model within ~1 secondsimilar clusters with no recognizable differences between them275 securities/companies5 clusters were taken as appropriate to ensure sufficient representation of all securitiesBased on the Exploratory Data Analysisconducted and the Clustering techniques used, the following insights and recommendations can be made:
Both clustering techniques yielded 5 clusters, where majority of securities were essentially clustered in a similar manner.
One cluster (with 3 securities as per k-means and 22 securities as per hierarchical clustering) majorly comprised of companies from Energy Sector.
risk-seeking investors can certainly look towards investing in securities from this cluster.Second cluster (with 275 securities each as per k-means and hierarchical clustering) was the most diversified portfolio of stocks, which allows investors to spread risk
Industrials, Financials, Consumer Discretionary and Health Care sectors.hedge risk can go forward with investing in securities from this cluster.The third cluster (with 21 securities as per k-means and 27 securities as per hierarchical clustering) majorly comprised of the Health Care, Consumer Discretionary and Information Technology sectors.
moderate risk and have the funds to invest, should go ahead with investing in securities from this cluster.The next cluster (with 11 securities as per k-means and 9 securities as per hierarchical clustering) majorly consisted of stocks from Financials, Telecommunications Services and. Health Care sectors.
afforable investment option with high revenue and moderate risk can opt for securities from this cluster.Clearly, CHANGE is the only constant in the stock market.
Also, while comparing the metric for a specific industry, an investor must be well aware of its ideal ratio values for differnt variables, e.g., Earnings Per Share, P/E ratio, P/B ratio, etc, as these variables are industry specific.
Lastly, while k-means and hierarchical clustering algorithms have yielded representative clusters, stock movements change daily and, therefore, Dynamic clustering would prove useful for further analysis of such data